Data Tidying
Humboldt-Universität zu Berlin
2023-12-13
Today we will learn…
The suggested resources for this topic are
languageR_english.csv dataset (in daten folder)df_eng <- read_csv(here("daten", "languageR_english.csv")) |>
1 clean_names() |>
2 arrange(word) |>
3 rename(
4 rt_lexdec = r_tlexdec,
5 rt_naming = r_tnaming
) |>
6 select(age_subject, word, word_category, rt_lexdec, rt_naming)janitor)
word in assending order (A-Z)
r_tlexdec to rt_lexdec
r_tlexdec to rt_lexdec
Abbildung 1: A model of the data science process from Wickham et al. (2023) (all rights reserved)
readr::read_csv)dplyr package)ggplot package)tidyr package)countryyearpopulationcases| country | year | cases | population |
|---|---|---|---|
| Afghanistan | 1999 | 745 | 19987071 |
| Afghanistan | 2000 | 2666 | 20595360 |
| Brazil | 1999 | 37737 | 172006362 |
| Brazil | 2000 | 80488 | 174504898 |
| China | 1999 | 212258 | 1272915272 |
| China | 2000 | 213766 | 1280428583 |
| country | year | type | count |
|---|---|---|---|
| Afghanistan | 1999 | cases | 745 |
| Afghanistan | 1999 | population | 19987071 |
| Afghanistan | 2000 | cases | 2666 |
| Afghanistan | 2000 | population | 20595360 |
| Brazil | 1999 | cases | 37737 |
| Brazil | 1999 | population | 172006362 |
| Brazil | 2000 | cases | 80488 |
| Brazil | 2000 | population | 174504898 |
| China | 1999 | cases | 212258 |
| China | 1999 | population | 1272915272 |
| China | 2000 | cases | 213766 |
| China | 2000 | population | 1280428583 |
| country | year | rate |
|---|---|---|
| Afghanistan | 1999 | 745/19987071 |
| Afghanistan | 2000 | 2666/20595360 |
| Brazil | 1999 | 37737/172006362 |
| Brazil | 2000 | 80488/174504898 |
| China | 1999 | 212258/1272915272 |
| China | 2000 | 213766/1280428583 |
Abbildung 2: Image source: Wickham et al. (2023) (all rights reserved)
“Happy families are all alike; every unhappy family is unhappy in its own way.”
— Leo Tolstoy
“Tidy datasets are all alike, but every untidy dataset is untidy in its own way.”
— Hadley Wickham
There are two main advantages to working with tidy data:
R’s vectorised nature can shine
tidyverse are designed to work with tidy data (e.g., ggplot2 and dplyr)Aufgabe 1: Tidy data
Beispiel 1
rate) that contains:
cases per country per year, divided bypopulation per country per year,10000dplyr verb creates new variables? (Look back at ?@sec-data-transform.)rate with these data structures?tidyversetidyr package (from the tidyverse) has two useful functions for transposing our data:
pivot_longer(): make wide data longerpivot_wider(): make long data widerAbbildung 3: die berühmteste Verwendung des Wortes Pivot (zumindest für Millenials) (Friends)
we often need to convert between wide and long data formats to do different types of summaries or visualisation
wide data: all of the observations about one thing are in the same row
long data: each observation is on a separate row
let’s start with the most typical case: turning wide data into long data
pivot_longer()languageR_english.csv dataset (df_eng)
age_subject tells us whether an observation is from a participant in the ‘old’ or ‘young’ age groupword, length_in_letters, written_frequency, and word_category describe properties of the stimulus for a given observation (i.e., the word)rt_lexdec contains the response time for a lexical decision taskrt_naming contains the response time for a word naming taskhead(df_eng)| age_subject | word | word_category | rt_lexdec | rt_naming |
|---|---|---|---|---|
| young | ace | N | 623.61 | 456.3 |
| old | ace | N | 775.67 | 607.8 |
| young | act | V | 617.10 | 445.8 |
| old | act | V | 715.52 | 639.7 |
| young | add | V | 575.70 | 467.8 |
| old | add | V | 742.19 | 605.4 |
Abbildung 4: Our plot to be reproduced
rt_lexdec and rt_naming appear in facets
facet_wrap() takes a categorical variable as its argument, and produces a facet of each categorylexdec and naming
pivot_longer()tidyr function pivot_longer() converts a wide data table to a longer format
1df_eng_long <-
2 df_eng %>%
3 pivot_longer(
4 cols = starts_with("rt_"),
5 names_to = "response",
6 values_to = "time"
)df_eng_long, which contains…
df_eng, and then
rt_
response that contains the names from cols
time that contains the values from cols
ace and act in
| age_subject | word | rt_lexdec | rt_naming |
|---|---|---|---|
| young | ace | 623.61 | 456.3 |
| old | ace | 775.67 | 607.8 |
| young | act | 617.10 | 445.8 |
| old | act | 715.52 | 639.7 |
| age_subject | word | response | time |
|---|---|---|---|
| young | ace | rt_lexdec | 623.61 |
| young | ace | rt_naming | 456.30 |
| old | ace | rt_lexdec | 775.67 |
| old | ace | rt_naming | 607.80 |
| young | act | rt_lexdec | 617.10 |
| young | act | rt_naming | 445.80 |
| old | act | rt_lexdec | 715.52 |
| old | act | rt_naming | 639.70 |
rt_lexdecrt_namingage_subject on the x-axistime on the y-axisresponse categories in facetsAbbildung 5: Response times per age group for the lexical decision task vs. naming task
Aufgabe 1: Tidy data
Beispiel 2
Recreate Abbildung 5
pivot_wider()tidyr function pivot_wider() make datasets wider by increasing columns and reducing rows
df_eng wider
young subject’s response and the old subject’s responsepivot_wider()pivot wider takes similar arguments to pivot_longer(), with some slight differences:
id_cols (optional): identifying columns (which columns uniquely identify each observation?)names_from: what should we call the new column containing the previous column names (must be a categorical variable)?names_prefix (optional): prefix for the new column names (optional)values_from: new column valuesage_subject, and their values from rt_lexdec. The result should look like Tabelle 8.df_eng_wide <-
df_eng %>%
select(-rt_naming) |>
pivot_wider(
1 names_from = age_subject,
2 values_from = rt_lexdec,
3 names_prefix = "lexdec_"
)age_subject
rt_lexdec
lexdec_ to the beginning of the new column names
ace and act in
| age_subject | word | word_category | rt_lexdec |
|---|---|---|---|
| young | ace | N | 623.61 |
| old | ace | N | 775.67 |
| young | act | V | 617.10 |
| old | act | V | 715.52 |
| word | word_category | lexdec_young | lexdec_old |
|---|---|---|---|
| ace | N | 623.61 | 775.67 |
| act | V | 617.10 | 715.52 |
Unique values
We’ve removed rt_naming because it also has a unique value per word per age group, so not removing it means we don’t change the length of our dataset (still one row per word per age group), we just change the width and introduce NA values for lexdec_young for old subjects and NA values for lexdec_old for young subjects. If we hadn’t removed it, our first 6 rows would’ve looked like Tabelle 9. Compare this to the output in Tabelle 8, do you see the difference?
| word | word_category | rt_naming | lexdec_young | lexdec_old |
|---|---|---|---|---|
| ace | N | 456.3 | 623.61 | NA |
| ace | N | 607.8 | NA | 775.67 |
| act | V | 445.8 | 617.10 | NA |
| act | V | 639.7 | NA | 715.52 |
Today we learned…
We’ll stick with the df_eng dataset for these tasks.
pivot_wider to create rt_naming new variables: naming_old and naming_young which containg the naming response times for old and young participants, respectively. Hint: you will need to remove rt_lexdec. The resulting data frame should have 2284 observations and 6 variables.pivot_wider().Abbildung 6: Scatterplot of lexical decision task response times per word for old versus young participants
df_eng_wide dataset to produce Abbildung 6? In other words, why is df_eng_wide the appropriate structure but not df_eng_long for such a scatterplot?df_eng_long and the summarise() function we saw in the last section, reproduce the summary below:# A tibble: 2 × 3
response mean sd
<chr> <dbl> <dbl>
1 rt_lexdec 708. 115.
2 rt_naming 566. 101.
Hint: do you need to remove NAs (we saw how to do this in the last section)?
Hergestellt mit R version 4.3.0 (2023-04-21) (Already Tomorrow) und RStudioversion 2023.9.0.463 (Desert Sunflower).
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] janitor_2.2.0 here_1.0.1 lubridate_1.9.2 forcats_1.0.0
[5] stringr_1.5.0 dplyr_1.1.3 purrr_1.0.2 readr_2.1.4
[9] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.3 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] utf8_1.2.3 generics_0.1.3 xml2_1.3.4 stringi_1.7.12
[5] hms_1.1.3 digest_0.6.33 magrittr_2.0.3 evaluate_0.21
[9] grid_4.3.0 timechange_0.2.0 fastmap_1.1.1 rprojroot_2.0.3
[13] jsonlite_1.8.7 httr_1.4.6 rvest_1.0.3 fansi_1.0.4
[17] viridisLite_0.4.2 scales_1.2.1 cli_3.6.1 rlang_1.1.1
[21] crayon_1.5.2 bit64_4.0.5 munsell_0.5.0 withr_2.5.0
[25] yaml_2.3.7 tools_4.3.0 parallel_4.3.0 tzdb_0.4.0
[29] colorspace_2.1-0 webshot_0.5.4 pacman_0.5.1 kableExtra_1.3.4
[33] png_0.1-8 vctrs_0.6.3 R6_2.5.1 magick_2.7.4
[37] lifecycle_1.0.3 snakecase_0.11.0 bit_4.0.5 vroom_1.6.3
[41] pkgconfig_2.0.3 pillar_1.9.0 gtable_0.3.4 Rcpp_1.0.11
[45] glue_1.6.2 systemfonts_1.0.4 highr_0.10 xfun_0.39
[49] tidyselect_1.2.0 rstudioapi_0.14 knitr_1.44 farver_2.1.1
[53] htmltools_0.5.5 labeling_0.4.3 svglite_2.1.1 rmarkdown_2.22
[57] compiler_4.3.0
Week 9 - Data Tidying